GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860

westonpace · 2023-05-31T21:18:09Z

Rationale for this change

The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type.

What changes are included in this PR?

Added custom_schema to WriteNodeOptions and removed custom_metadata.

Are these changes tested?

Yes, I added a new C++ unit test to verify that the custom info is applied to written files.

Are there any user-facing changes?

No. Only new functionality (which is user facing)

Closes: [Python] write_dataset does not preserve non-nullable columns in schema #35730

github-actions · 2023-05-31T21:18:34Z

⚠️ GitHub issue #35730 has been automatically assigned in GitHub to PR creator.

westonpace · 2023-05-31T21:18:36Z

Starting in draft stage as I believe we should get some python unit tests in here and @anjakefala volunteered to do some.

anjakefala · 2023-05-31T22:06:52Z

So far, I added 2 basic tests, based on my understanding of the feature!

The basic case where you write a single table, which contains a field with nullability specified, passes.

Note that this one:

  # we can specify the nullability of a field through the schema                                                                                   
    pa.dataset.write_dataset(table_no_null, tempdir/"nulltest2", schema=schema_nullable)                                                             
    dataset = ds.dataset(tempdir/"nulltest2", format="parquet")                                                                                      
    assert dataset.to_table().schema.equals(schema_nullable)

is failing for now. I did not specify nullability in the table's schema, but then specified it in write_dataset(schema=schema_nullable).

Is it expected that the returned dataset would have a field with nullability?

In this example table has a field specified with nullability, while table_no_null does not:

pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable)

The resulting schema also does not have nullability.

anjakefala · 2023-05-31T22:12:33Z

I have confirmed that

 pa.dataset.write_dataset(table, tempdir/"nulltest2", schema=schema_nullable, format="parquet")

has a field with nullability.

The following do not:

pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable, format="parquet")

or

pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", schema=schema_nullable, format="parquet")

…We can, however, create an InMemoryDataset with mixed field metadata and so I created a test for that

westonpace · 2023-06-01T05:03:13Z

The following do not:

pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable, format="parquet")

or

pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", schema=schema_nullable, format="parquet")

These lines failed for me with the following error:

pyarrow/dataset.py:936: in write_dataset
    data = InMemoryDataset(data, schema=schema)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise ArrowTypeError(
E   pyarrow.lib.ArrowTypeError: Item has schema
E   x: int64
E   y: int64
E   which does not match expected schema
E   x: int64 not null
E   y: int64

I thought this was supported and it took me a moment to track down what was going on. The error is actually being raised before the C++ call to write the dataset. Pyarrow is taking the two inputs (table, table_no_null) and trying to put them in an InMemoryDataset and specifying the schema. The constructor for InMemoryDataset is verifying that all the tables it has been given have the same schema and throwing an error because it was given a table whose schema does not match the dataset's schema.

If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR. Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature).

This is kind of confusing because @anjakefala and I were testing earlier and you are allowed to create an InMemoryDataset with tables / batches who have the same types / nullability but different field metadata. So I created an additional python test case for field metadata which does verify the "two tables but mixed metadata can be overridden by an explicit schema" call.

westonpace · 2023-06-01T05:03:39Z

Oh, and thank you for adding the test case :)

anjakefala · 2023-06-01T06:33:25Z

If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR. Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature).

Good to know!

jorisvandenbossche

Looks good to me!

I just have one question about whether we should also validate the names of the schema; the rest are minor code style nitpicks in the test code, for which I can also push some changes.

jorisvandenbossche · 2023-06-01T07:36:54Z

cpp/src/arrow/dataset/file_base.cc

+    }
+    for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
+      if (!input_schema->field(field_idx)->type()->Equals(
+              custom_schema->field(field_idx)->type())) {


Should we also test that the names of the fields are equal?

Changing the names should be safe. Admittedly, a user could also do this name change by inserting a project node before the write node.

I could be convinced otherwise but I don't think this does any harm and I think, as a user, I would expect this behavior, so it wouldn't be surprising that the names changed.

python/pyarrow/tests/test_dataset.py

pitrou · 2023-06-01T09:16:30Z

cpp/src/arrow/dataset/file_base.cc

+
+  if (custom_schema != nullptr) {
+    if (custom_schema->num_fields() != input_schema->num_fields()) {
+      return Status::Invalid(


Status::TypeError here?

Switched. Although I will confess I'm not entirely clear on when to use one over the other. In my mind TypeError is only for cases where a dynamic_cast fails.

Yes, to be honest for having a wrong number of fields, I would also raise a ValueError/Invalid (it's a wrong value, not a wrong type)

A schema error is a type error IMHO.

pitrou · 2023-06-01T09:16:42Z

cpp/src/arrow/dataset/file_base.cc

+    for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
+      if (!input_schema->field(field_idx)->type()->Equals(
+              custom_schema->field(field_idx)->type())) {
+        return Status::Invalid("The provided custom_schema specified type ",


jorisvandenbossche · 2023-06-01T09:47:51Z

cpp/src/arrow/dataset/file_base.h


  /// \brief Options to control how to write the dataset
  FileSystemDatasetWriteOptions write_options;
-  /// \brief Optional metadata to attach to written batches
-  std::shared_ptr<const KeyValueMetadata> custom_metadata;


Instead of removing this option (as a breaking change), we could in theory still allow the user to specify one of both?

(I am not using the C++ API for this, so I don't know how useful this would be / how cumbersome it is to specify the schema if you only want to specify metadata. From the DatasetWriter point of view, this is a fine change of course since there we already have the full schema)

If it were a new feature I would argue it's not worth it (A user could technically use DeclarationToSchema to get the output schema of the plan leading up to the write and then attach custom metadata to that). However, given we have already released custom_metadata, and I would like Acero's API to start being stable, I suppose I should set an example. Thanks for the nudge. I have restored custom_metadata

westonpace · 2023-06-01T18:34:57Z

The failing test on AMD64 is unrelated (flight ucx test). There is still one R test running but I think this is good to go now.

cpp/src/arrow/dataset/file_base.h

pitrou · 2023-06-01T18:38:10Z

cpp/src/arrow/dataset/file_base.h

@@ -34,6 +34,7 @@
 #include "arrow/filesystem/filesystem.h"
 #include "arrow/io/file.h"
 #include "arrow/util/compression.h"
+#include "arrow/util/key_value_metadata.h"


This shouldn't be necessary if arrow/type_fwd.h is included.

I removed this include and explicitly included arrow/type_fwd.h.

pitrou · 2023-06-01T18:42:17Z

@jorisvandenbossche Do you want to take a look at the Python test changes?

Co-authored-by: Antoine Pitrou <[email protected]>

anjakefala

I like the python test changes!

Weston took out the invalid test case, and added one that covers how the field metadata gets applied.

jorisvandenbossche · 2023-06-01T20:34:56Z

Do you want to take a look at the Python test changes?

There were no changes to this since my previous review (and push), so all good!

westonpace · 2023-06-01T20:35:34Z

I'm going to go ahead and merge if green. I think it's quite late for Joris and I believe Raul will be starting the patch build soon. If Joris finds an issue later then we can still scrap the patch release candidate for a fix (though hopefully we don't need to).

… write (#35860) ### Rationale for this change The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type. ### What changes are included in this PR? Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`. ### Are these changes tested? Yes, I added a new C++ unit test to verify that the custom info is applied to written files. ### Are there any user-facing changes? No. Only new functionality (which is user facing) * Closes: #35730 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Weston Pace <[email protected]>

raulcd · 2023-06-02T10:51:11Z

@github-actions crossbow submit test-r-ubuntu-22.04

github-actions · 2023-06-02T10:53:33Z

Revision: 875dfed

Submitted crossbow builds: ursacomputing/crossbow @ actions-42981f190e

Task	Status
test-r-ubuntu-22.04

ursabot · 2023-06-03T20:13:36Z

Benchmark runs are scheduled for baseline = 3fe4a31 and contender = 018e7d3. 018e7d3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.8% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.33% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.6% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 018e7d3f ec2-t3-xlarge-us-east-2
[Finished] 018e7d3f test-mac-arm
[Finished] 018e7d3f ursa-i9-9960x
[Finished] 018e7d3f ursa-thinkcentre-m75q
[Finished] 3fe4a315 ec2-t3-xlarge-us-east-2
[Finished] 3fe4a315 test-mac-arm
[Finished] 3fe4a315 ursa-i9-9960x
[Finished] 3fe4a315 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…ataset write (apache#35860) The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type. Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`. Yes, I added a new C++ unit test to verify that the custom info is applied to written files. No. Only new functionality (which is user facing) * Closes: apache#35730 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Weston Pace <[email protected]>

Added the ability to specify custom schema on a dataset write

83d5de4

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels May 31, 2023

Added Python tests

6d007d1

github-actions bot added the Component: Python label May 31, 2023

westonpace added 2 commits May 31, 2023 15:25

Fix a bug with the instantiation of MockFileSystem

47ad9e6

Turns out we can't create an InMemoryDataset with mixed nullability. …

cffdb1e

…We can, however, create an InMemoryDataset with mixed field metadata and so I created a test for that

westonpace marked this pull request as ready for review June 1, 2023 05:03

westonpace requested a review from AlenkaF as a code owner June 1, 2023 05:03

jorisvandenbossche approved these changes Jun 1, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 1, 2023

small edits to the test

79266ee

pitrou reviewed Jun 1, 2023

View reviewed changes

jorisvandenbossche reviewed Jun 1, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 1, 2023

thisisnic added 3 commits June 1, 2023 13:03

Use new API

66d14ca

Update active binding call

661b3a3

Linter

0ea3e34

thisisnic requested review from paleolimbot and thisisnic as code owners June 1, 2023 12:04

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 1, 2023

Now that I reverted the API I need to adjust the R changes slightly

563b084

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 1, 2023

westonpace requested review from jorisvandenbossche and pitrou June 1, 2023 18:35

pitrou reviewed Jun 1, 2023

View reviewed changes

westonpace and others added 2 commits June 1, 2023 12:28

Update cpp/src/arrow/dataset/file_base.h

4d79225

Co-authored-by: Antoine Pitrou <[email protected]>

Remove include to favor type_fwd

875dfed

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 1, 2023

anjakefala approved these changes Jun 1, 2023

View reviewed changes

jorisvandenbossche approved these changes Jun 1, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 1, 2023

westonpace merged commit 018e7d3 into apache:main Jun 1, 2023

westonpace added this to the 12.0.1 milestone Jun 1, 2023

paleolimbot mentioned this pull request Jun 6, 2023

[R] Ensure that schema metadata can actually be set as a named character vectory #35952

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860

GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860

westonpace commented May 31, 2023 •

edited

Loading

github-actions bot commented May 31, 2023

westonpace commented May 31, 2023

anjakefala commented May 31, 2023 •

edited

Loading

anjakefala commented May 31, 2023

westonpace commented Jun 1, 2023

westonpace commented Jun 1, 2023

anjakefala commented Jun 1, 2023

jorisvandenbossche left a comment •

edited

Loading

jorisvandenbossche Jun 1, 2023

westonpace Jun 1, 2023

pitrou Jun 1, 2023

westonpace Jun 1, 2023

jorisvandenbossche Jun 1, 2023

pitrou Jun 1, 2023

pitrou Jun 1, 2023

westonpace Jun 1, 2023

jorisvandenbossche Jun 1, 2023

westonpace Jun 1, 2023

westonpace commented Jun 1, 2023

pitrou Jun 1, 2023

westonpace Jun 1, 2023

pitrou commented Jun 1, 2023

anjakefala left a comment

jorisvandenbossche commented Jun 1, 2023

westonpace commented Jun 1, 2023

raulcd commented Jun 2, 2023

github-actions bot commented Jun 2, 2023

ursabot commented Jun 3, 2023

GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860

GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860

Conversation

westonpace commented May 31, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 31, 2023

westonpace commented May 31, 2023

anjakefala commented May 31, 2023 • edited Loading

anjakefala commented May 31, 2023

westonpace commented Jun 1, 2023

westonpace commented Jun 1, 2023

anjakefala commented Jun 1, 2023

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jun 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jun 1, 2023

anjakefala left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 1, 2023

westonpace commented Jun 1, 2023

raulcd commented Jun 2, 2023

github-actions bot commented Jun 2, 2023

ursabot commented Jun 3, 2023

westonpace commented May 31, 2023 •

edited

Loading

anjakefala commented May 31, 2023 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading